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ABSTRACT 

A collaborative framework for detecting the different sources in 
mixed signals is presented in this paper. The approach is based on C- 
HiLasso, a convex collaborative hierarchical sparse model, and pro- 
ceeds as follows. First, we build a structured dictionary for mixed 
signals by concatenating a set of sub-dictionaries, each one of them 
learned to sparsely model one of a set of possible classes. Then, 
the coding of the mixed signal is performed by efficiently solving a 
convex optimization problem that combines standard sparsity with 
group and collaborative sparsity. The present sources are identified 
by looking at the sub-dictionaries automatically selected in the cod- 
ing. The collaborative filtering in C-HiLasso takes advantage of the 
temporal/spatial redundancy in the mixed signals, letting collections 
of samples collaborate in identifying the classes, while allowing in- 
dividual samples to have different internal sparse representations. 
This collaboration is critical to further stabilize the sparse repre- 
sentation of signals, in particular the class/sub-dictionary selection. 
The internal sparsity inside the sub-dictionaries, as naturally incor- 
porated by the hierarchical aspects of C-HiLasso, is critical to make 
the model consistent with the essence of the sub-dictionaries that 
have been trained for sparse representation of each individual class. 
We present applications from speaker and instrument identification 
and texture separation. In the case of audio signals, we use sparse 
modeling to describe the short-term power spectrum envelopes of 
harmonic sounds. The proposed pitch independent method automat- 
ically detects the number of sources on a recording. 

1. INTRODUCTION AND MOTIVATION 

Sparse signal modeling has been shown to lead to numerous state- 
of-the-art results in signal processing, in addition to being very at- 
tractive at the theoretical level. The standard model assumes that a 
signal can be efficiently represented by a sparse linear combination 
of atoms from a given or learned dictionary. The selected atoms form 
the active set, whose cardinality is significantly smaller than the size 
of the dictionary and the dimension of the signal. Adding struc- 
tural constraints to this active set has value both at the level of rep- 
resentation robustness and at the level of signal interpretation; e.g., 
1 1 2 3 1. This leads to group or structured sparse coding, the atoms 
are grouped and a few groups are active at a time. An alternative 
way to add structure (and robustness) to the problem is to consider 
the simultaneous and collaborative encoding of multiple signals, re- 
questing that they all share the same active set; e.g., ||4l|5l[6l. 

In the (linear) source separation problem, an observed signal is 
assumed to be a linear superposition (mixture) of several sources, 
and the primary task is to estimate from it each of the unmixed 
sources. If the task is only to identify the active sources, the problem 
is called source identification. In this case, since the original sources 
do not need to be recovered, the modeling can be done in terms of 
features extracted from the original signals in a non-bijective way. 
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We propose to first use traditional sparse modeling tools to learn a 
dictionary for each one of the Q possible classes. Concatenating 
these dictionaries, D = [Di|D2| . . . |Dg], any mixture signal pro- 
duced by that ensemble will be accurately represented as a sparse 
linear combination of the atoms of this larger dictionary D. In this 
case one expects the resulting sparsity patterns to have a particular 
structure, with sub-dictionaries active following the classes present 
in the mixture. In addition, the time correlation in audio signals and 
the spatial correlation in images suggests that there is an important 
correlation between neighboring samples that should be exploited. 
Consider for example a piece of music, where a few out of many po- 
tential instruments are playing simultaneously at different times. For 
small time windows, we can assume that the same few instruments 
are playing at all instants (each instant represents a mixture sample), 
so that the corresponding same few groups or sub-dictionaries will 
be active in all samples. However, we do not expect the sound pro- 
duced by each instrument to be the same at each instant, the internal 
activation per sub-dictionary will be sample dependent. 

We propose to use the Collaborative Hierarchical Lasso (C- 
HiLasso) model that combines the benefits of structured and collab- 
orative sparse coding in a hierarchical sparse model, with sparsity 
both at the group and single coefficient levels, and where multiple 
signal samples/instances collaborate in the recovery of their com- 
mon active groups. However, for each signal, the active atoms within 
the shared active groups are particular to that signal realization. The 
internal sparsity inside the blocks, which is not present in standard 
structured/block sparsity models, is critical, since each block corre- 
sponds to a sub-dictionary D^ learned to efficiently represent sig- 
nals of one of the possible classes in a sparse coding fashion. Not 
considering such in-block sparsity will then be contradictory to the 
essence of the dictionary model, while not considering sparsity and 
collaboration at the block level will contradict the fact that only a 
few classes are active per instance of the signal, and such classes 
are shared, e.g., in audio as explained above. Previously proposed 
sparsity models, e.g., group or collaborative sparsity, or elastic net 
1 7 1, don't have these characteristics, which are critical for the prob- 
lem at hand and consistent with the realistic assumptions about the 
signal. In 1 8 1 we provide additional details and variations of the pro- 
posed model, including a detailed comparison of C-HiLasso with 
recent literature, theoretical results regarding recovery guarantees, 
and an efficient optimization techniques that ensure convergence to 
the global optimum. The goal of this work is to show how the this 
framework can be successfully applied to several types of signals by 
appropriately selecting the features. 

In Section |2] we briefly describe the CHiLasso model. In Sec- 
tion |3] we address the problem of single-channel speaker and instru- 
ment identification. The feature selection is crucial for the success 
and the efficiency of the model. The proposed method uses the spec- 
tral envelope as feature vectors and does not require the estimation 
of the fundamental frequency of the sources. In Section[4]we address 
the problem of texture separation and identification. 



2. COLLABORATIVE HIERARCHICAL SPARSE CODING 

We have a set of data samples Xj G , j = 1 , . . . , n, and a dic- 
tionary of p atoms in R^, assembled as a matrix D G R"^^^ . Each 
sample Xj can be written as Xj = Da^ + e, G R^, e G R^, 
II e II 2 ^ l|xj||2- The underlying assumption in sparse modeling 
is that the "optimal" reconstruction rj has only a few nonzero el- 
ements. The convex formulation of this representation, known in the 
literature as Lasso f9 |, can be efficiently solved using general pur- 
pose or specialized optimization techniques. A popular variant is the 
unconstrained version, 

minl||x,-Da||^ + A||a||,, (2.1) 

a Z 

where A is an parameter value, usually found by cross-validation. 

In many situations, one has prior knowledge that certain groups 
of atoms are simultaneously selected in the coding. Designing a 
model that takes this into account naturally leads to a better result. 
Suppose that a dictionary of p atoms is divided into Q groups We 
refer to the sub-dictionary of atoms of D belonging to a group G as 
Dg, and the corresponding set of linear reconstruction coefficients 
as ac. The Group Lasso problem is, 1 1 1, 

min - ||x^- - Da||2 + ^ ||aG||2 • (2.2) 

G=l 

Note that \2.2\ reduces to \2.l\ when the groups contain only one 
atom, and its effect on the groups of a is a natural generalization of 
Lasso: it turns on/off coefficients in groups. 

In numerous applications, one expects that certain collections of 
samples, X = [xi, . . . , Xn] G R^^^, share the same active com- 
ponents from the dictionary, that is, the indexes of the correspond- 
ing nonzero coefficients, A = [ai, . . . , an] G R^^^, are the same 
for all the samples in the collection. Imposing such dependency in 
the ii regularized regression problem gives rise to the so called col- 
laborative (also called "multitask" or "simultaneous") sparse coding 
problem ifTOl . The model is given by 

1 ^ II II 

min-||X-DA||^ + A^||a'| , (2.3) 

fe=i ^ 

where a^ G R^ is the k-th row of A, that is, the vector of the n 
different values that the coefficient associated to the k-th atom takes 
for each sample j — 1, . . . , n. If we extend this idea to the Group 
Lasso, we obtain a collaborative Group Lasso (C-GLasso), 

1 ^ II II 

min-||X-DA||^ + AV A^ , (2.4) 

A z ^^Jl IIF 

where A^ is the sub-matrix formed by all the rows belonging to 
group G. 

As explained in Section[T] in our proposed strategy for perform- 
ing source separation, each Dg is trained for sparsely representing 
one of the possible sources in the mixture. Hence, the sparsity pat- 
tern of the coefficients of the mixture signals is hierarchical: spar- 
sity at the group and atom levels. In this situation the C-GLasso 
would fail to recover the true sparsity pattern, since it promotes all 
the atoms in the sub-dictionary to be selected simultaneously. We 
also need to consider collaboration at the group level, as in ( |2.4| ), but 

^For simplicity we assume that all the groups have the same size. The 
extension to the general case is straightforward, see 1 1] for details. 



not at the individual atoms level. The only alternative that can han- 
dle all these requirements simultaneously is C-HiLasso (see 1 8 1 for 
details on how to automatically set the regularizer parameters A 1,2 
and also details on the optimization), 

1 ^ II II 

min-||X-DA||^ + A2^ |a^|| +Ai^||a,||,. (2.5) 

The regularizer in \1.5\ is a combination of the ones used in C- 
GLasso and Lasso and as such encourages the signals to share the 
same groups (classes), while the active set inside each group is signal 
dependent. Note that the last term in \2.5) can be replaced by a group 
sparsifying norm, e.g., if atoms on the sub-dictionary have some cor- 
relation. Previous approaches have only considered particular cases, 
such as structured coding |2 |, hierarchy without collaboration |11|, 
or collaboration without hierarchy II12II13I . The comprehensive new 
model, 1 8 1, is needed for the important applications presented next. 

3. SOURCE IDENTIFICATION IN AUDIO 

Source identification is a classic problem in audio analysis, see |[74l 
15 1 and references therein. Here is addressed with the C-HiLasso 
model. 

3.1. Feature Selection for Speaker Identification 

A challenging aspect when identifying audio sources is to obtain fea- 
tures that are specific to each source and at the same time invariant 
to changes in the fundamental frequency (tone) of the sources. In the 
case of speech, sounds can be divided into two main groups, voiced 
and unvoiced sounds. Of the two, only the former contains informa- 
tion useful for identifying the speaker. Since unvoiced sounds have 
much less energy than voiced ones, we can easily remove them from 
the feature extraction process, so that the identification is performed 
solely with the voiced sounds. To describe voiced sounds, we use 
their short-term power spectrum envelopes (SE) as feature vectors, 
which is a common choice in speaker recognition tasks 1 16 |. 

The SE in human speech varies along time, producing differ- 
ent patterns for each phoneme. Then, a speaker does not produce a 
unique SP for voiced sounds, but a set that lives in a union of mani- 
folds. Since such manifolds are well represented by sparse models, 
the SE characteristics are well suited for the sparse modeling frame- 
work. For C-HiLasso, the feature extraction process needs to be 
linear, and extracting the SE is not a linear operation. To overcome 
this, we propose a method inspired on the Mel Frequency Coeffi- 
cients (MFCC) technique 1 16|, exploiting the harmonic properties 
of voiced sounds. 

Assume that we observe a signal y{n) that is a linear mixture of 
c harmonic sources, 

where ai are the mixing coefficients, Ei and fi are the SE and funda- 
mental frequency of the i-th source respectively, fs is the sampling 
frequency, and is the phase of the A:-th harmonic (or partial), out 
of K, of that source. As with MFCC, we start the analysis of an au- 
dio window by performing a short-term Fourier transform (STFT) on 
it. Since phase information is irrelevant for computing the spectral 
envelope, we only keep the magnitude of the obtained STFT. In or- 
der to amplify the frequency range of interest, we apply an emphasis 
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Fig. 1. Feature extraction for audio signals. From left to right: sample 
analysis window, its STFT magnitude and emphasis curve, emphasized STFT 
magnitude, DCT and low frequency samples used as features (dotted). 



window W{0) to the STFT magnitude, obtaining 

where S is the Dirac distribution]^ Finally, we perform a dis- 
crete cosine transform (DCT) on the emphasized STFT magnitude 
(which is a real function), obtaining, for each source, the convo- 
lution of a low frequency lobe that approximates the spectrum of 
the emphasized SE, with a sequence of spikes corresponding to 
the fundamental frequency and its partials. Keeping only the low- 
frequency coefficients of the computed DCT in \3.6\ , we obtain 
^i=i OiiDCT {W{0)Ei{0)}, which is a linear combination of the 
spectrums of the emphasized SE of the present sources. We then 
obtained a linear relation between the sources and their spectral en- 
velopes. The feature extraction process is summarized in Figure[T] 

The audio files were re-sampled at fs — 16kHz. The features 
were taken based on the STFT with a frame length of 512 samples 
and an overlapping of 75% using a Manning window. The emphasis 
in frequency is E{f) = 1 + a/, a = 2/fs. After the DCT, the 
lowest 60 coefficients form the features. 

3.2. Speaker Identification via C-HiLasso 

The data for this case consists of six minutes long recordings of five 
different German radio speakers, two female and three male[^ One 
quarter of the samples are used for training and the rest for testing. 

First we want to ensure that the proposed features and the sparse 
modeling framework are well suited for this application. We ana- 
lyzed the dataset using two very simple classifiers: k nearest neigh- 
bors, and the classifier proposed in 1 17 1. In the latter case, following 
standard dictionary learning techniques, a dictionary is learned 
for each class using the corresponding training samples. Each nor- 
malized testing sample, Xj , is then assigned to the class for which 
the risk R(Di, Xj) = mina^- ||xj — D^aj || + A ||aj || is minimized 
(A is learned via cross validation). The error rate obtained for each 
speaker by each method is shown in Figure [2|left). Although a per- 
sample error rate of 17.0% is not small, each sample corresponds 
to a time window of 32ms and we can safely assume that the same 
speaker will be active during several consecutive samples. Thus a 
simple voting scheme on top of any of these classifiers would re- 
duce the error significantly. The collaboration, naturally included in 
C-HiLasso, is crucial for the identification, also when mixtures are 
present as detailed next. 

In the analysis above we assumed the strong hypothesis that only 
one speaker is active at a time. We now relax this and test the per- 
formance of C-HiLasso in identifying speakers in mixture signals. 
Clearly, k nearest neighbors can't be used in this case. For each 
speaker, a sub-dictionary of 90 atoms was learned from the train- 
ing dataset (we observed that the exact dictionary size is not critical 
to the results of the algorithm). We extracted 10 non-overlapping 



frames of 15 seconds each, and encoded them using C-HiLasso. 
The experiment was repeated for all possible combinations of two 
speakers, and all the speakers talking alone. In order to quantify the 
performance we measured the Hamming distance between the de- 
tected active sources and the ground truth. We compared the results 
only against the Lasso algorithm. This is the canonical experiment 
since the dictionaries where trained to sparsely represent the data (C- 
GLasso assumes that all the atoms in the sub-dictionary are active 
simultaneously). C-HiLasso obtained a Hamming distance of 0.053, 
showing a very good capability of automatically detecting the active 
sources (speakers) on each frame without having the number of ac- 
tive sources as prior knowledge. Lasso gives a Hamming distance of 
3.33. The Hamming distance when there is only one speaker in the 
mixture signal is 0.08 and 0.04 when there are two of them. Again 
here the Lasso performs worse giving 4 and 3 respectively. 

In Figure [2] we show the results obtained for each frame. One 
could think of adding robustness to this method by evaluating over- 
lapping time frames and doing time regularization. 



3.3. Instrument Identification 

Unlike the case of the human voice where the fundamental frequency 
can vary over a small range of values, in musical instruments it can 
vary considerably from one instrument to the other. For example in 
the experiment bellow, the fundamental frequencies vary from 80Hz 
(bassoon) to 1600 Hz (flute). The above proposed features represent 
a good description of the spectral envelope for low fundamental fre- 
quency sources. When considering sounds with high fundamental 
frequency, the descriptor represents a mixture of information of the 
envelope and fundamental frequency together. This happens because 
a non-adaptive linear operation can't separate them for a very wide 
range of fundamental frequencies. At first glance this may appear as 
a drawback, but in fact it becomes an advantage as it includes some 
information from the fundamental frequency into the descriptor, still 
keeping reasonable dictionary size. 

We used the Development Set for MIREX 2007 MultiFO Esti- 
mation Tracking Ta^^]^ which consists of a 52 second long musical 
piece played by five different wind instruments (bassoon, clarinet, 
flute, horn and oboe), and a set of tracks where each instrument plays 
individually. We used the first half of these audio tracks for training, 
and the rest for testing. In some passages of this piece, the instru- 
ments are arranged harmonically (forming chords), meaning that the 
notes one plays are partials of the fundamental note played by others. 
Thus, in these passages, the partials of the intervening instruments 
superimpose. 

The experiment for this case is analogous to the one with speak- 
ers, with the testing tracks divided in frames of 3 seconds each. The 
results are shown in Figure [2] The average Hamming distance be- 
tween the identified sources and the ground truth for C-HiLasso and 
Lasso was respectively 0.16 and 2.46 when only one source was ac- 
tive, and 0.18 and 2.76 for all combinations of two sources, for a 
total average of 0.17 and 2.56. Once again, this demonstrates the 
power of C-HiLasso in collaboratively identifying the correct instru- 
ments (classes or sub-dictionaries). The hierarchical component is 
critical since, while all the signals share the active instruments (and 
the sub-dictionaries), each time frame is different, meaning they are 
represented using different atoms of the detected sub-dictionaries. 



^We only write the positive part of the spectrum. 
^The dataset is available from the authors upon request. 



^http : / /www . music-ir . org/mirex/wiki/. 
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Fig. 2. Results for audio source identification. The leftmost table shows single source detection results for a simple k-NN classifier (with k = 5 and 25) 
and the sparse model classifier presented in hi 7^ . The center and right figures show identification results obtained with C-HiLasso when the sources are 
speakers (left) and wind instruments (right). Each column of the graph corresponds to the sources identified for a specific time frame, with the true ones 
marked by yellow dots. The vertical axis indicates the estimated activity of the different sources, where darker colors indicate higher energy. In the speaker 
identification case, we have 10 frames (15 seconds of audio) for each possible combination and in the instrument case, 8 frames (3 seconds). For speakers we 
used ( Ai , A20 ) = (0.8, 0.008) and for the instruments ( Ai , A2q ) = (0.8, 0.015). We observe how the number and type of classes are correctly identified. 



4. SOURCE SEPARATION IN TEXTURE IMAGES 

Using sparse modeling for addressing the source separation problem 
in images has been addressed in |[T8l|T9l. The methods are designed 
for non-collaborative separation of mixtures of two given classes. In 
this section, we explore the capabilities of C-HiLasso for source sep- 
aration in images which are mixtures of a few o ut of several possible 
textures drawn from the Brodatz dataset, Figure 3 ^ The columns of 
X contain the pixel values of all possible square windows of 10 x 10 
pixels in the mixture image as m = 10^ dimensional vectors. The 
sub-dictionaries Dg for each texture source were obtained off-line 
from training samples taken from the left halves of the texture im- 
ages, while the samples used in the tests were taken from the right 
halves. Clearly, if the image is a mixture of a number of source tex- 
ture images, then every sample will also be a mixture of the same 
corresponding classes in the source images, and the hypothesis of 
C-HiLasso will hold. The experiment was repeated for all possible 
28 combinations of 2 out of 8 possible source textures. In terms of 
detected groups, the C-HiLasso achieves near perfect performance, 
with an average Hamming error between the true and estimated ac- 
tive sets of 0.14. Lasso is clearly not designed for this task, yielding 
an average Hamming error of 2.8. The best average PSNR (APSNR) 
obtained with C-HiLasso for all combinations was 23. 7 dB, which is 
2dB larger than the 21.7 dB obtained with Lasso. The C-GLasso ob- 
tains a Hamming error of 0.62 (three times larger than C-HiLasso), 
and gives a significantly lower APSNR of 19.8dB, clearly showing 
that the model is not good for representing the data. 

We conclude that C-HiLasso is efficient for collaboratively iden- 
tifying sources in a set of mixture signals. The framework is capable 
of identifying sources in audio and identifying and recovering mixed 
sources in images, always detecting the number of sources present 
in the mixture. 
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